Preparation

Install packages

You need to do this only once!
The script below checks whether the listed CRAN packages are installed on your computer. If they are not, they will be automatically installed.

## First specify the packages of interest
packages = c("devtools","hdf5r","dplyr","ggplot2","stringr",
             "RColorBrewer","useful","readr","BiocManager")

## Now load or install&load all
package.check <- lapply(
  packages,
  FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE)
      library(x, character.only = TRUE)
    }
  }
)

Seurat for scRNAseq data analysis

install.packages('Seurat')

The following is necessary for interaction with hdf5 objects, and can only be installed using BiocManager installer.

# You need this only once
BiocManager::install("rhdf5")

1. Introduction

In this COO, you will learn how to analyse single cell RNA sequencing data (scRNAseq) using the well established Seurat pipeline.

This pipeline includes functions that do most of the work ‘behind the scenes’. This makes the tool very user friendly and suitable for todays tutorial. There is extensive documentation on the use of the pipeline. The following tutorial is the closest to what we will do today: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

Set working directory

You may need to run this on the terminal, since in RStudio, the working directory is set to your home directory after the chunk is run! An alternative is to use the pull down menu at the top of the bottom-right window on RStudio. Click Files/More to see the option

setwd("/Users/onurbasak/Documents/1_BCRM/11 teaching/Bachelor/Bioinformatics_BMW/2023/COO/")

Load libraries

library(dplyr,verbose = FALSE)
library(ggplot2,verbose = FALSE)
library(stringr,verbose = FALSE)
library(Seurat,verbose = FALSE)
library(RColorBrewer,verbose = FALSE)
library(useful,verbose = FALSE)
library(readr,verbose = FALSE)
library(hdf5r,verbose = FALSE)

2. Seurat analysis

For today’s tutorial, we will use the scRNAseq atlas of the adolescent mouse cortex published by Saunders et al 2018. This data is extensive and is available at mousebrain.org, which we briefly discussed at the end of the lecture. There is an online tool with which you can browse the data. You can do this, for instance, to get inspiration for your practice questions.

Aim

Process the scRNAseq data of the adolescent mouse cortex in order to reveal major cell clusters and identify cell types

\(~\)
Why? A major use of the scRNAseq is cell type identification. To achieve this, you need to perform quality control steps and cluster analysis. To visualize the data, you will perform dimensionality reduction. Finally, you can plot marker genes that you find from the literature to reveal the cell type identity of clusters \(~\)

It is time for the scRNAseq analysis! We will use the Seurat object that we have uploaded. This object made specially for the Seurat pipeline has a lot of ‘slots’ where information can be stored, and the architecture is a bit complicated. You do not need it for this tutorial, except what is mentioned

2.1 The dataset

The data (‘Linnerson_cortex_10X22.rds’) downloaded and processed into a ‘Seurat object’ to prevent technical errors caused by the ‘loom’ file format in several computers.

You can download them from Blackboard, in the COO/data folder, or from the following Github page made for this course: https://github.com/neurOnur/BMW_scRNAseq_COO_2023

2.1.1 Load the dataset

Load the Seurat object that was saved as rds.

dataset <- readRDS(file = 'data/Linnerson_cortex_10X22.rds')
dataset
## An object of class Seurat 
## 27998 features across 6658 samples within 1 assay 
## Active assay: RNA (27998 features, 0 variable features)

Note: The object contains data from 6658 cells (samples) and 27998 features (genes). There is 1 assay (RNA).

2.1.2 Downsize the dataset

To save from time, we can subset the Seurat object by selecting 1000 random cells. For this, we can use the subset() function.

# Run only if you have performace issues
# dataset_backup = dataset
# dataset <- subset(dataset, downsample = 1000)
# dataset

2.1.2 Check the metadata

This is where ‘cell level’ information is stored. This means there is one value for each cell.

kable(head(dataset@meta.data[,1:6]),digits = 6)
orig.ident nCount_RNA nFeature_RNA Age AnalysisPool AnalysisProject
cortex1_10X22_2_ACAATAACTGTAGC-1 10X22 5030 1942 p23 Cortex1 Adolescent
cortex1_10X22_1_AGGTTCGAAACGTC-1 10X22 2122 1010 p23 Cortex1 Adolescent
cortex1_10X22_1_TGACTTTGGCATAC-1 10X22 2677 1219 p23 Cortex1 Adolescent
cortex1_10X22_2_GTGAGGGACGTAAC-1 10X22 2648 1232 p23 Cortex1 Adolescent
cortex1_10X22_2_CACCGTTGGACGTT-1 10X22 4586 1939 p23 Cortex1 Adolescent
cortex1_10X22_2_GAGCGAGAACGTAC-1 10X22 5807 2292 p23 Cortex1 Adolescent

2.2 Quality metrics

2.2.1 Plot some quality metrics

An important metric is the number of RNA molecules (nCount_RNA) and genes (nFeature_RNA) per cell. These are automatically calculated when the Seurat object is generated form a data matrix.

VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA"), cols = "blue",
                pt.size = .01)

2.2.2 Calculate additional QC metrics

Start by generating QC metrics additional to the no of genes/features.

Mitochondrial RNA is the mRNA that is generated by the mitochondrial genome. Normally, these constitute a small fraction of the total mRNA. However, in dying or damaged cells, while cytoplasmic/nuclear mRNA degrades rapidly, mitochondrial mRNA is rather well preserved. Thus, a high ratio of mitochondrial mRNA indicates BAD cell quality.

mRNA coding for the Ribosomal subunit proteins is abundant (not to be confused with rRNA, which does not code for protein but is a part of the ribosome complex). Usually, a high ribosomal RNA percentage indicates production of a lot of proteins, and is very high in dividing cells or some secretory cells that need to constantly produce proteins. However, if most of the mRNA (>30-50%) that we detect is ribosomal, it means that the valuable information in this cell would be very limited and that we should exclude it from the analysis.

dataset <- PercentageFeatureSet(dataset,pattern='^mt-', col.name = "percent.mt")
dataset <- PercentageFeatureSet(dataset,pattern='Rp(s|l)', col.name = "percent.ribo")

2.2.3 Plot the additional quality metrics

We can use the VlnPlot() function of the Seurat package to visualise the QC metrics.

plot0 <- VlnPlot(object = dataset, features = c("percent.mt", "percent.ribo"),pt.size = 0, cols = "blue")
plot0

Visualize how mito and ribo percentages change as a function of the number of counts.

plot1 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "percent.mt",pt.size = 2, cols = "blue")
plot2 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "percent.ribo",pt.size = 2, cols = "blue")
plot3 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "nFeature_RNA",pt.size = 2, cols = "blue")
plot_null <- ggplot() + theme_void()
(plot1 + plot2) / (plot3 + plot_null)

Question 1

What is the relationship between total number of RNA per cell (nCounts) and
i) mitochondrial RNA percentage?
ii) ribosomal RNA percentage?
ii) number of features? \(~\)

2.3 Filter low quality cells

Practice 1

We do not want to have low quality cells in our data. Looking at the plot, determine which cells to get rid of. \(~\)

\(~\)

Use subset() to fetch the cells that fit in your description

cutoff_mito = ##ENTER A VALUE HERE##
cutoff_ribo = ##ENTER A VALUE HERE###
dataset <- subset(x = dataset, subset = percent.mt < cutoff_mito & percent.ribo < cutoff_ribo)

2.4 Normalise

In Seurat, standard preprocessing workflow is replaced by a single command. However, it is good to see this part to learn each step. First, we will normalize the data. This is to get rid of the differences in total RNA counts between cells. In other words, we will equalize the total count number in each cell to a fixed number (e.g. 10000 RNA molecules per cell).

dataset <- NormalizeData(object = dataset, normalization.method = "LogNormalize", scale.factor = 10000)

2.5 Detection of variable genes across the single cells

We want to find out ‘informative genes’ that will explain biological differences to use in some of hte downstream applications. If a gene is expressed everywhere, that doesnt tell us much. However, if a gene is expressed in a subset of cells, this will cause ‘variation’.

We can detect these genes using FindVariableFeatures() function.

## Here, we select the top 1,000 highly variable genes (Hvg) for downstream analysis.
dataset <- FindVariableFeatures(object = dataset, selection.method = 'mean.var.plot', mean.cutoff = c(0.0125, 3), dispersion.cutoff = c(0.5, Inf))
length(x = VariableFeatures(object = dataset)) #3084
## [1] 8532

Identify the 10 most highly variable genes.

top10 <- head(VariableFeatures(dataset), 10)
top10
##  [1] "Hba-a1"  "Hbb-bs"  "Ptgds"   "Hbb-bt"  "Apoe"    "Npy"     "Sst"    
##  [8] "Gm26778" "Erbb2"   "Mt1"

Now visualise.

## Plot
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(dataset)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot2

Note:Dispersion indicates variation, while red color shows ‘significantly variable’ genes

Print the top 1000 highly variable genes.

hv.genes <- head(rownames(HVFInfo(object = dataset)), 1000)
head(hv.genes,n=100) # list the first 100
##   [1] "Cd79b"     "Tbxas1"    "Tifab"     "Itpripl2"  "Notum"     "Acvrl1"   
##   [7] "Rad54b"    "Mill2"     "Hcn2"      "Tnfaip6"   "Cldn11"    "Ermn"     
##  [13] "Gjb1"      "Car14"     "Lpar1"     "Gsn"       "Trf"       "Serpinb1a"
##  [19] "Itgb4"     "Opalin"    "Gpr37"     "Aspa"      "Mag"       "Mog"      
##  [25] "Tmem88b"   "Plekhh1"   "Dock10"    "Tspan2"    "Prr5l"     "Gjc3"     
##  [31] "Hapln2"    "Ppp1r14a"  "Mal"       "Ugt8a"     "Fa2h"      "Arsg"     
##  [37] "Phldb1"    "Gpr17"     "Sema5a"    "Kif19a"    "Cd9"       "Bcas1"    
##  [43] "Tmem163"   "Tcf7l2"    "Enpp6"     "Rassf10"   "Col9a3"    "Pak4"     
##  [49] "Rnd2"      "Myrf"      "Bfsp2"     "Il23a"     "Ndrg1"     "Fgfr2"    
##  [55] "Ttyh2"     "Anln"      "Trim59"    "Plekhg3"   "Padi2"     "Pacs2"    
##  [61] "Cdh19"     "Ms4a7"     "F13a1"     "Lyz2"      "Pf4"       "Clec4n"   
##  [67] "Ccl24"     "Clec4a2"   "Ccl12"     "Cyba"      "Cbr2"      "Cd209f"   
##  [73] "Mrc1"      "Fcer1g"    "Cd14"      "Cd37"      "Hpgds"     "Arl11"    
##  [79] "Stac"      "Hk3"       "Spi1"      "Fcrls"     "Slc2a5"    "Lcp2"     
##  [85] "Slc39a12"  "Trem2"     "Hmha1"     "Tgfbr1"    "Ext1"      "Pld4"     
##  [91] "AI662270"  "Dock2"     "Abi3"      "Lrrc25"    "Cd53"      "Cyth4"    
##  [97] "Siglech"   "Adap2os"   "Fcgr1"     "Fcgr2b"

2.6 Scale the data and get rid of the confounders

We discussed scaling (standardization) in the previous section. Here, we will only take the highly variable genes (hv.genes) to scale and use in downstream dimensionality reduction.

We can also get rid of the confounding factors at this point. These factors introduce ‘technical noise’ to the data. For instance, the number of reads per cell can influence the amount of information in a cell and make it seem different from another cell with low RNA levels, even though they are similar cells.

We will use the ScaleData() function of hte Seurat package. The confounding factors can be discarded using `vars.to.regress`.

dataset <- ScaleData(object = dataset, features = hv.genes, vars.to.regress = c("percent.mt","nFeature_RNA"))
## Regressing out percent.mt, nFeature_RNA
## Centering and scaling data matrix

Dimentionality reduction

2.7 PCA analysis

2.7.1 Perform PCA

Performing Dimensionality reduction in Seurat is very simple! We can first calculate the PCA.

dataset <- RunPCA(object = dataset, features = hv.genes, verbose = FALSE,npcs = 50)

2.7.2 Plot PCA results

plot1 <- DimPlot(object = dataset, reduction = 'pca',dims = c(1,2))
plot1

It is also easy to find out genes that contribute to the + and - direction of the top PCs! We can also plot the level of contribution.

print(dataset[["pca"]], dims = 1:5, nfeatures = 5) # First box
## PC_ 1 
## Positive:  Cldn5, Flt1, Ly6c1, Pglyrp1, Ly6a 
## Negative:  C1qb, C1qa, C1qc, Ctss, P2ry12 
## PC_ 2 
## Positive:  Cldn11, Mog, Tspan2, Fa2h, Ugt8a 
## Negative:  C1qa, Ctss, C1qb, Tyrobp, C1qc 
## PC_ 3 
## Positive:  Aldoc, Gpr37l1, Gja1, Slc1a3, Mfge8 
## Negative:  Mog, Cldn11, Mal, Trf, Ppp1r14a 
## PC_ 4 
## Positive:  Cntnap5a, Calb1, Lamp5, March1, Tmem59l 
## Negative:  Slc1a3, Gpr37l1, Plpp3, Aldoc, Mfge8 
## PC_ 5 
## Positive:  Calb1, Fam19a1, Lamp5, Rasgrf2, Cntnap5a 
## Negative:  Rprm, Hs3st4, Nxph3, Foxp2, Syt6
VizDimLoadings(dataset, dims = 1:2, reduction = "pca") # Second box

Question 2

Some genes contribute more to the variation then others… What do these genes and principle components tell us?
\(~\)

2.7.3 Visualise how genes contribute to PCs on heatmaps

We can use an integrated function of Seurat to plot heatmaps to visualize genes that drive different principle components

PCHeatmap(dataset,dims = 1:6,ncol =2)

Plot some of these genes on PCA plots to see how their expression is distributed along different PCs. For this, we will look at the first 4 PCs, indentify genes that are highest on + and - axis and plot them on PC1/PC2 and PC3/PC4 plots.

The first two components:

PCHeatmap(dataset,dims = c(1,2),ncol =2)

Aldoc and Mog genes look interesting:

plot_f1 <- FeaturePlot(dataset,features = c('Aldoc','Mog'),reduction = 'pca',dims = c(1,2),cols = c('gray','blue','red'))
plot_f2 <- FeaturePlot(dataset,features = c('Aldoc','Mog'),reduction = 'pca',dims = c(3,4),cols = c('gray','blue','red'))
plot_f1 / plot_f2

Now components 3 and 4:

PCHeatmap(dataset,dims = c(3,4),ncol =2)

C1qa and Ly6c1 genes look interesting:

plot_f3 <- FeaturePlot(dataset,features = c('Cldn1','Ly6c1'),reduction = 'pca',dims = c(1,2),cols = c('gray','blue','red'))
plot_f4 <- FeaturePlot(dataset,features = c('C1qa','Ly6c1'),reduction = 'pca',dims = c(3,4),cols = c('gray','blue','red'))
plot_f3 / plot_f4

Practice 2

Check principle components 10, 20, 30, 40 and 50.
- What differences do you see?
- Would you include all principle components for downsteram analysis? Why/why not? \(~\)

\(~\)

2.8 Find clusters using SNN Graph Construction and the louvain algorithm

We will use a graph-based clustering algorithm discussed at the lecture.

2.8.1 SNN Graph Construction

We need to build a neighborhood graph. In this network graph, each cell will be a node, and their similarity in high dimensional space will become their edges.

One could say that cells closest to each other reside in a neighborhood.

dataset <- FindNeighbors(object = dataset, dims = 1:20) 
## Computing nearest neighbor graph
## Computing SNN

2.8.2 Find clusters using the louvain algorithm

dataset <- FindClusters(object = dataset, resolution = 0.6) # changing the resolution will change the size/number of clusters! c(0.6, 0.8, 1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6658
## Number of edges: 246566
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9164
## Number of communities: 18
## Elapsed time: 0 seconds
VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA"))

2.9 UMAP and t-SNE analysis

We will use the top PCs to calculate the umap and tSNE coordinates. You can change the numnber of PCs based on your observation in Practice 2.

dataset <- RunUMAP(object = dataset, reduction = "pca", dims = 1:20, verbose = FALSE)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
dataset <- RunTSNE(object = dataset, dims = 1:20, verbose = FALSE)

Practice 3

Do you want to see them? adjust the DimPlot function to show the PCA, UMAP and t-SNE results! \(~\)
Include the cluster colors to visualize clusters. \(~\)

Question 3

How do the clusters distribute on different plots? \(~\)
Why is there a difference? \(~\)

Practice 4

Identify cell types!

Check Use Google or Pubmed to find marker genes for neurons, inhibitory neurons, astrocytes and oligodendrocytes.

Hint: The following papers have marker genes for relevant cell types:

Plot the expression of these marker genes using the VlnPLot() function. \(~\)

Practice 5

Visualise the cell type annotation of the authors!

In this specific case, authors have already annotated different cell types.The information is stored at the ‘Class’ and ‘Subclass’ columns of the meta.data of the Seurat object dataset. This is located at dataset@meta.data \(~\)
Plot different cell types. Use can use DimPlot() and plot a UMAP, just like above. You can plot the information in the meta.data using the ‘group.by’ option \(~\)
What do you see? How do these compare to the clusters that you have identified? \(~\)

END OF THE PRACTICAL!

The following parts are extras. If you have time, please follow them

3. Example questions

The following is a question from teh last year. We didn’t go into details of k-Means this year, thus the question may look out of context. But it will give you an idea of what can be expected. A indicated the answer

Question 1

Please list four important facts about the k-means algorithm on the following topics: i) What is it used for? > A: For classification of data into groups/clusters

  1. Please explain the important steps > A: Determine the expected number of clusters. Take random points in the data and calculate the distance of each point to these random points to assign clusters. Then calculate the center of the cluster. Finally, repeat this process until there is no change in the centeral point, meaning that a stability is reached.

  2. Name one of the drawbacks > A: needs estimation of the number of clusters beforehand. Cannot work on all different ‘shapes’ of data. Cannot find outliers. Does not show how similar each cluster is.

  3. If you run the k-means algorithm on the same data two different times, do you always get the same results? Why/why not? > A: No. The process is stochastic, starts randomly and is repeated many times until stability is reached. The final results will, in most cases, be different.


Here is another question for you to think about:

Question 2

Which steps of the single cell RNA sequencing analysis aims at getting rid of the differences in number of reads between cells (e.g. a cell with 1000 features and another with 5000 features)?

  1. Scaling
  2. Normalization
  3. Regression
  4. Dimensionality reduction
  5. Clustering

A: I wont provide the answer for this one

4. EXTRA / NOT TO BE GRADED - post practical

Follow this part if there is time left after the peer - group discussion

3.1 Discuss UMAP versus t-SNE

> In theory, UMAP is very similar to t-SNE. Both are:

  • machine learning algorithms
  • are stochastic (or random)
  • try to show to relationship between cells taking all the gene expression into account
  • good in finding and visualizing pattern in the data

> They differ in 3 major points:

  • t-SNE is better is preserving local relationship, while UMAP is better at preserving the global structure of the data.
  • UMAP is faster and more efficient
  • UMAP is also easier to optimize. t-SNE can, in theory, yield very similar results to UMAP. But this needs ‘tweeking’ of quite a few parameters

3.2 For the curious, some more details

NOTE! THIS IS VERY HIGH LEVEL

You are not expected to follow/understand this part and none of it will be in the exam. Thius is purely for people who are curious about the details of the techniques

3.2.1 t-SNE (t-distributed stochastic neighbor embedding)

A machine learning algorithm that will place cells similar on higher dimension (e.g. with respect to 20000 genes/dimensions) close to each other in the lower dimension.

More specifically, t-SNE is a simple machine learning algorithm that minimizes the divergence between two distributions of pairwise similarity; input objects and the corresponding low-dimensional points in the embedding.

The formula looks like this:

source: “https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

You definitely dont need to know this! But this is all t-sNE is; 4 steps repeated again and again. The following plot show 1000 iterations that the algorithm performs. Note that with each calculation, clusters are better seperated. However, it is still hard to find where exactly to place some of these cells, that continue to jiggle even at the end.

Source: (https://www.oreilly.com/content/an-illustrated-introduction-to-the-t-sne-algorithm/)

3.2.2 UMAP (Uniform Manifold approximation and projection)

UMAP is a machine learning algorithm that uses ‘manifolds’ to estimate the relationship between the data points at high dimension with lower dimension. Manifolds are built on simplices. Here are some low dimensional examples.

There is also a formula involved in the calculations which we wont show. The algorithm with a ‘different approach’ than t-SNE to find the relationship between cells (or otehr data points/samples).

You can find more information on : “https://umap-learn.readthedocs.io/en/latest/how_umap_works.html

Summary

Is UMAP better than t-SNE, or vice versa? Neither, really. UMAP has more advantages while looking as ‘global similarities’ while t-SNE performs well when looking at ‘local interactions’. Both are extremely good and popular methods in visualising high dimensional data.